Goto

Collaborating Authors

 unimodal bias



RUBi: Reducing Unimodal Biases for Visual Question Answering

Neural Information Processing Systems

Visual Question Answering (VQA) is the task of answering questions about an image. Some VQA models often exploit unimodal biases to provide the correct answer without using the image information. As a result, they suffer from a huge drop in performance when evaluated on data outside their training set distribution. This critical issue makes them unsuitable for real-world settings. We propose RUBi, a new learning strategy to reduce biases in any VQA model. It reduces the importance of the most biased examples, i.e. examples that can be correctly classified without looking at the image.



Reviews: RUBi: Reducing Unimodal Biases for Visual Question Answering

Neural Information Processing Systems

Originality: The proposed method is a novel dynamic loss re-weighting technique applied to VQA under changing priors condition, aka VQA-CP, where the train and test sets are deliberately constructed to have different distributions. The related works are adequately cited and discussed. While prior works have also focused on using knowledge from a question-only model to capture unnecessary biases in the dataset [25], the paper differs from [25] in some key aspects. E.g., the proposed model guides the whole model (including the visual encoding branch) to learn "harder" examples better whereas [25] focuses on only reducing bias from question encoding. Quality: The proposed method is sound and well-motivated.


Reviews: RUBi: Reducing Unimodal Biases for Visual Question Answering

Neural Information Processing Systems

After the authors' rebuttal all reviewers believe the paper makes a significant enough contribution to be accepted to the conference. When there is a need to obtain large amounts of data for complex tasks such as VQA, bias in the labeling process is highly likely. Techniques that improve robustness to such biases can have a significant impact in these cases. The authors should incorporate the clarifications and results from the rebuttal into the paper and address the reviewers comments.


RUBi: Reducing Unimodal Biases for Visual Question Answering

Neural Information Processing Systems

Visual Question Answering (VQA) is the task of answering questions about an image. Some VQA models often exploit unimodal biases to provide the correct answer without using the image information. As a result, they suffer from a huge drop in performance when evaluated on data outside their training set distribution. This critical issue makes them unsuitable for real-world settings. We propose RUBi, a new learning strategy to reduce biases in any VQA model. It reduces the importance of the most biased examples, i.e. examples that can be correctly classified without looking at the image.


Quantifying and Mitigating Unimodal Biases in Multimodal Large Language Models: A Causal Perspective

Chen, Meiqi, Cao, Yixin, Zhang, Yan, Lu, Chaochao

arXiv.org Artificial Intelligence

Recent advancements in Large Language Models (LLMs) have facilitated the development of Multimodal LLMs (MLLMs). Despite their impressive capabilities, MLLMs often suffer from an over-reliance on unimodal biases (e.g., language bias and vision bias), leading to incorrect answers in complex multimodal tasks. To investigate this issue, we propose a causal framework to interpret the biases in Visual Question Answering (VQA) problems. Within our framework, we devise a causal graph to elucidate the predictions of MLLMs on VQA problems, and assess the causal effect of biases through an in-depth causal analysis. Motivated by the causal graph, we introduce a novel MORE dataset, consisting of 12,000 VQA instances. This dataset is designed to challenge MLLMs' abilities, necessitating multi-hop reasoning and the surmounting of unimodal biases. Furthermore, we propose two strategies to mitigate unimodal biases and enhance MLLMs' reasoning capabilities, including a Decompose-Verify-Answer (DeVA) framework for limited-access MLLMs and the refinement of open-source MLLMs through fine-tuning. Extensive quantitative and qualitative experiments offer valuable insights for future research. Our project page is at https://opencausalab.github.io/MORE.


A Theory of Unimodal Bias in Multimodal Learning

Zhang, Yedi, Latham, Peter E., Saxe, Andrew

arXiv.org Artificial Intelligence

Using multiple input streams simultaneously in training multimodal neural networks is intuitively advantageous, but practically challenging. A key challenge is unimodal bias, where a network overly relies on one modality and ignores others during joint training. While unimodal bias is well-documented empirically, our theoretical understanding of how architecture and data statistics influence this bias remains incomplete. Here we develop a theory of unimodal bias with deep multimodal linear networks. We calculate the duration of the unimodal phase in learning as a function of the depth at which modalities are fused within the network, dataset statistics, and initialization. We find that the deeper the layer at which fusion occurs, the longer the unimodal phase. A long unimodal phase can lead to a generalization deficit and permanent unimodal bias in the overparametrized regime. In addition, our theory reveals the modality learned first is not necessarily the modality that contributes more to the output. Our results, derived for multimodal linear networks, extend to ReLU networks in certain settings. Taken together, this work illuminates pathologies of multimodal learning under joint training, showing that late and intermediate fusion architectures can give rise to long unimodal phases and permanent unimodal bias.


Synthetic Misinformers: Generating and Combating Multimodal Misinformation

Papadopoulos, Stefanos-Iordanis, Koutlis, Christos, Papadopoulos, Symeon, Petrantonakis, Panagiotis C.

arXiv.org Artificial Intelligence

With the expansion of social media and the increasing dissemination of multimedia content, the spread of misinformation has become a major concern. This necessitates effective strategies for multimodal misinformation detection (MMD) that detect whether the combination of an image and its accompanying text could mislead or misinform. Due to the data-intensive nature of deep neural networks and the labor-intensive process of manual annotation, researchers have been exploring various methods for automatically generating synthetic multimodal misinformation - which we refer to as Synthetic Misinformers - in order to train MMD models. However, limited evaluation on real-world misinformation and a lack of comparisons with other Synthetic Misinformers makes difficult to assess progress in the field. To address this, we perform a comparative study on existing and new Synthetic Misinformers that involves (1) out-of-context (OOC) image-caption pairs, (2) cross-modal named entity inconsistency (NEI) as well as (3) hybrid approaches and we evaluate them against real-world misinformation; using the COSMOS benchmark. The comparative study showed that our proposed CLIP-based Named Entity Swapping can lead to MMD models that surpass other OOC and NEI Misinformers in terms of multimodal accuracy and that hybrid approaches can lead to even higher detection accuracy. Nevertheless, after alleviating information leakage from the COSMOS evaluation protocol, low Sensitivity scores indicate that the task is significantly more challenging than previous studies suggested. Finally, our findings showed that NEI-based Synthetic Misinformers tend to suffer from a unimodal bias, where text-only MMDs can outperform multimodal ones.


RUBi: Reducing Unimodal Biases for Visual Question Answering

Cadene, Remi, Dancette, Corentin, younes, Hedi Ben, Cord, Matthieu, Parikh, Devi

Neural Information Processing Systems

Visual Question Answering (VQA) is the task of answering questions about an image. Some VQA models often exploit unimodal biases to provide the correct answer without using the image information. As a result, they suffer from a huge drop in performance when evaluated on data outside their training set distribution. This critical issue makes them unsuitable for real-world settings. We propose RUBi, a new learning strategy to reduce biases in any VQA model.